dgit.raspbian.org Git

x86: allow Dom0 read-only access to IO-APICs

There are BIOSes that want to map the IO-APIC MMIO region from some
ACPI method(s), and there is at least one BIOS flavor that wants to
use this mapping to clear an RTE's mask bit. While we can't allow the
latter, we can permit reads and simply drop write attempts, leveraging
the already existing infrastructure introduced for dealing with AMD
IOMMUs' representation as PCI devices.

This fixes an interrupt setup problem on a system where _CRS evaluation
involved the above described BIOS/ACPI behavior, and is expected to
also deal with a boot time crash of pv-ops Linux upon encountering the
same kind of system.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>

x86: make page table handling error paths preemptible

... as they may take significant amounts of time.

This requires cloning the tweaked continuation logic from
do_mmuext_op() to do_mmu_update().

Note that in mod_l[34]_entry() a negative "preemptible" value gets
passed to put_page_from_l[34]e() now, telling the callee to store the
respective page in current->arch.old_guest_table (for a hypercall
continuation to pick up), rather than carrying out the put right away.
This is going to be made a little more explicit by a subsequent cleanup
patch.

This is part of CVE-2013-1918 / XSA-45.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

x86: make page table unpinning preemptible

... as it may take significant amounts of time.

Since we can't re-invoke the operation in a second attempt, the
continuation logic must be slightly tweaked so that we make sure
do_mmuext_op() gets run one more time even when the preempted unpin
operation was the last one in a batch.

This is part of CVE-2013-1918 / XSA-45.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

x86: make arch_set_info_guest() preemptible

.. as the root page table validation (and the dropping of an eventual
old one) can require meaningful amounts of time.

This is part of CVE-2013-1918 / XSA-45.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

x86: make vcpu_reset() preemptible

... as dropping the old page tables may take significant amounts of
time.

This is part of CVE-2013-1918 / XSA-45.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

x86: make MMUEXT_NEW_USER_BASEPTR preemptible

... as it may take significant amounts of time.

This is part of CVE-2013-1918 / XSA-45.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

x86: make new_guest_cr3() preemptible

... as it may take significant amounts of time.

This is part of CVE-2013-1918 / XSA-45.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

x86: make vcpu_destroy_pagetables() preemptible

... as it may take significant amounts of time.

The function, being moved to mm.c as the better home for it anyway, and
to avoid having to make a new helper function there non-static, is
given a "preemptible" parameter temporarily (until, in a subsequent
patch, its other caller is also being made capable of dealing with
preemption).

This is part of CVE-2013-1918 / XSA-45.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

x86: call unmap_vcpu_info() regardless of guest type

This fixes a regression from 63753b3e ("x86: allow
VCPUOP_register_vcpu_info to work again on PVHVM guests").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>

libxl: unconst the event argument to the event_occurs hook.

The event is supposed to become owned, and therefore freed, by the application
and the const prevents this.

Unfortunately there is no way to remove the const without breaking existing
callers. The best we can do is use the LIBXL_API_VERSION provisions to remove
the const for callers who wish only to support the 4.3 API and newer.

Callers who wish to support 4.2 will need to live with casting away the const.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Jim Fehlig <jfehlig@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

xen/arm: nuke some stray hard tabs.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

xenstore: create pidfile in init-xenstore-domain

Since libxl checks for the existance of /var/run/xenstored.pid in order
to ensure xenstore is running, create this file when starting the
xenstore stub domain. This also changes the Makefile to enable the
creation of the init-xenstore-domain tool during tools compilation,
since the existing Makefile incorrectly added to the ALL_TARGETS list
when compiling the stubdom, when this variable is not used.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

MAINTAINERS: add Samuel as stubdom and mini-os maintainer

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Keir Fraser <keir@xen.org>
Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>

libxl: adjust point of backend name resolution

Resolution of a backend name to a domid needs to happen a little earlier
in some cases.

For example, if a domU is specified as a backend for a
disk and, as previously written, libxl__device_disk_setdefault() calls
libxl__resolve_domid() last, then disk->backend_domid still equals
LIBXL_TOOLSTACK_DOMID when libxl__device_disk_set_backend() is called.
This results in libxl__device_disk_set_backend() making an incorrect
attempt to validate the target by calling stat() on a file on dom0,
resulting in ERROR_INVAL (see libxl_device.c lines 239-248), which
prevents creation of the frontend domain.

Likewise, libxl__device_nic_setdefault() previously made use of
nic->backend_domid before it was set.

Signed-off-by: Eric Shelton <eshelton@pobox.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>

xen: arm: correct platform detection in public header.

These headers cannot use the CONFIG_FOO defines provided when building Xen
(since they aren't provided when building tools or by external components) and
need to use the compiler provided architecture defines.

This manifested itself as a failure to build xenctx.c on ARM64 due to the
missing symbols contains .

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

More emacs local variable block fixes.

The emacs variable to set the C style from a local variable block is
c-file-style, not c-set-style.

These were either missed by 82639998a5f2 or have crept back in since.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Keir Fraser <keir@xen.org>

xen: introduce vcpu_block

Rename do_block to vcpu_block.

Move the call to local_event_delivery_enable out of vcpu_block, to a new
static function called vcpu_block_enable_events.

Use vcpu_block_enable_events instead of do_block throughout in
schedule.c

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Keir Fraser <keir@xen.org>

xen/arm: move the tlb_flush in create_p2m_entries to the end of the function

Move the flush after the pagetable entry has actually been written to
avoid races with other vcpus refreshing the same entriy.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

xen/arm: do not call __cpu_disable on machine_halt

__cpu_disable shouldn't be called on machine_halt, in fact it cannot
succeed: cpu_disable_scheduler won't be able to migrate away vcpus to
others pcpus.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

libxl: fix spelling of "backend-id" for vtpm

Signed-off-by: Marek Marczykowski <marmarek@invisiblethingslab.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

xen/arm: correct the computation of the number of interrupt lines for the GIC

In the GIC manual, the number of interrupt lines is computed with the
following formula: 32(N + 1) where N is the value retrieved from GICD_TYPER.

Without the +1 Xen doesn't initialize the last 32 interrupts and can get
garbage on these registers.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

xsm: fix printf format string for strlen result

strlen returns size_t:

policydb.c: In function \91policydb_read\92:
policydb.c:1779: error: format \91%lu\92 expects type \91long unsigned int\92, but argument 3 has type \91size_t\92

This is probably benign on 64-bit x86 but was found by Dharshini on 32-bit Xen
4.2.x. I expect it affects ARM too.

Reported-by: Dharshini Tharmaraj <dharshinitharmaraj@gmail.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>

x86/HVM: move per-vendor function tables into .init.data

hvm_enable() copies the table contents rather than storing the pointer,
so there's no need to keep these tables post-boot.

Also constify the return values of the per-vendor initialization
functions, making clear that once the per-vendor initialization is
complete, the vendor specific tables won't get modified anymore.

Finally, in hvm_enable(), use the returned pointer for all read
accesses as being more efficient than global variable accesses. Writes
of course still need to go to the global variable.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

x86/EFI: fix runtime call status for compat mode Dom0

The top two bits (indicating error/warning classification) need to
remain the top two bits.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>

x86/vMCE: bugfix of vmce injection

uint16_t is not suitable to store VMCE_INJECT_BROADCAST (which is
defined as -1).

Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>

libxl: stat the path for all non-qdisk backends (including unknown)

The commit a8a1f236a296 "libxl: Only call stat() when adding a disk if we
expect a device to exist." changed things to only stat the file when the phy
backend was explicitly requested. This broke the case where we are probing and
would normally be able to decide on the phy option.

Since the intention of that commit was to allow for backends with no explicit
file in dom0 (i.e. network remote backend such as ceph) the lowest impact fix
appears to be to make that explicit. It turns out that tap disk can also
potentially handle such paths.

The only backend which requires a local file/device is PHY but we need to
handle UNKNOWN too in order for subsequent probing to work. Note that it is
not possible to autoprobe the backend if the path is not a local object, so we
don't need to worry about autoprobing ceph etc.

This should probably be revisited to rationalize the probing.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>

xen/arm: nr_lrs should be static

nr_lrs is only used in gic.c

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

xen/arm: Fix return value when write is ignored in VGIC

If a write is ignored, the function should return success.

Currently Xen will throw a data abort exception if the write in VGIC is
ignored.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

libxl: write IO ABI for disk frontends

This is a patch to forward-port a Xend behaviour. Xend writes IO ABI used for
all frontends. Blkfront before 2.6.26 relies on this behaviour otherwise guest
cannot boot when running in 32-on-64 mode. Blkfront after 2.6.26 writes that
node itself, in which case it's just an overwrite to an existing node which
should be OK.

In fact Xend writes the ABI for all frontends including console and vif. But
nowadays only old disk frontends rely on that behaviour so that we only write
the ABI for disk frontends in libxl, minimizing the impact.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

arm64: Fix compilation error with EARLY_PRINTK disabled

arm64/head.S: Assembler messages:
arm64/head.S:391: Error: operand 1 should be an integer register -- `mov pc,lr'

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

xenctx: Support arm64 and actually implement output for 32 and 64 bit

A bit basic and fuggly but a start.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

xenctx: remove trailing whitespace

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

xenctx: remove remnants of ia64 support

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

arm: mark vcpus as initialised when they have been

I noticed this because XEN_DOMCTL_getvcpucontext won't return anything for a
VCPU which isn't initialised.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

arm: allocate per-PCPU domheap pagetable pages

the domheap mappings are supposed to be per-PCPU. Therefore xen_pgtable
becomes a per-PCPU variable and we allocate and setup the page tables for each
secondary PCPU just before we tell it to come up.

Each secondary PCPU starts out on the boot page table but switches to its own
page tables ASAP.

The boot PCPU uses the boot pagetables as its own.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: TIm Deegan <tim@xen.org>

arm: add build time asserts for various virtual address aligment constraints

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

arm: parenthesise argument to *_linear_offset macros

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

xen: arm: rename xen_pgtable to boot_pgtable

The intention is that in a subsequent patch each PCPU will have its own
pagetables and that xen_pgtable will become a per-cpu variable. The boot
pagetables will become the boot cpu's pagetables.

For now leave a #define in place for those places which semantically do mean
xen_pgtable and not boot_pgtable.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>

install qemu into the location specified via configure --prefix.

Install qemu into the location specified via configure --prefix.
You will notice when you use something else than /usr/local.

Signed-off-by: Christoph Egger <chegger@amazon.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

tools/pygrub: Fix install when $(BINDIR) and $(PRIVATE_BINDIR) are the same

Do not override pygrub with a symbolic link in this case.

Signed-off-by: Christoph Egger <chegger@amazon.de>
Reviewed-by: Matt Wilson <msw@amazon.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- reworded summary to fit on one line ]

tools/xenbackendd: make 'gmake clean' properly cleaning

tools/xenbackendd: properly cleanup
Do not leave builds on gmake clean.

Signed-off-by: Christoph Egger <chegger@amazon.de>
Reviewed-by: Matt Wilson <msw@amazon.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

libxl: Only call stat() when adding a disk if we expect a device to exist.

We consider calling stat() a helpful error check in the following
circumstances only:
1. the disk backend type must be PHYsical
2. the disk backend domain must be the same as the running libxl
    code (ie LIBXL_TOOLSTACK_DOMID)
3. there must not be a hotplug script because this would imply that
    the device won't be created until after the hotplug script has
    run.

With this fix, it is possible to use qemu's built-in block drivers
such as ceph/rbd, with a xl config disk spec like this:

disk=[ 'backendtype=qdisk,format=raw,vdev=hda,access=rw,target=rbd:rbd/ubuntu1204.img' ]

Signed-off-by: David Scott <dave.scott@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>

hotplug: add openvswitch script

Based on Waldi's RFC at
http://lists.xen.org/archives/html/xen-devel/2012-09/msg00943.html

To use it set vif.default.script="vif-openvswitch" in /etc/xen/xl.conf or use
script=vif-openvswitch in the vif configuration.

Appears to do the right thing for PV and HVM guests (including tap devices)
and with stubdomains.

In order to support VLAN tagging and trunking the "bridge" specified in the
configuration can have a special syntax, that is:

BRIDGE_NAME[.VLAN][:TRUNK:TRUNK]

e.g.
- xenbr0.99
add the VIF to VLAN99 on xenbr0
- xenbr0:99:100:101
add the VIF to xenbr0 as a trunk port receiving VLANs 99, 100 & 101

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Bastian Blank <waldi@debian.org>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: dev@openvswitch.org

ns16550: delay resume until dom0 ACPI has a chance to run

Check for ioport access, before fully resuming operation, to avoid
spinning in __ns16550_poll when reading the LSR register returns 0xFF
on failing ioport access.

On some systems (like Lenovo T410, and some HP machines of similar vintage)
there is a SuperIO card that provides this legacy ioport on the LPC bus.

In this case, we need to wait for dom0's ACPI processing to run the proper
AML to re-initialize the chip, before we can use the card again.

This may cause a small amount of garbage to be written to the serial log
while we wait patiently for that AML to be executed.

This implementation limits the number of retries, to avoid a situation
where we keep trying over and over again, in the case of some other failure
on the ioport.

Signed-Off-By: Ben Guthro <benjamin.guthro@citrix.com>
Acked-by: Keir Fraser <keir@xen.org>

x86: remove IS_PRIV_FOR references

The check in guest_physmap_mark_populate_on_demand is redundant, since
its only caller is populate_physmap whose only caller checks the
xsm_memory_adjust_reservation hook prior to calling.

Add a new XSM hook for the other two checks since they allow privileged
domains to arbitrarily map a guest's memory.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com> (release perspective)

x86/hvm: convert access check for nested HVM to XSM

This adds an XSM hook for enabling nested HVM support, replacing an
IS_PRIV check. This hook is a partial duplicate with the xsm_hvm_param
hook, but using the existing hook would require adding the index to the
hook and would require the use of a custom hook for the xsm-disabled
case (using XSM_OTHER, which is less immediately readable) - whereas
adding a new hook retains the clarity of the existing code.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com> (release perspective)

cpupool: prevent a domain from moving itself

In the XEN_SYSCTL_CPUPOOL_OP_MOVEDOMAIN operation, the existing check
for domid == 0 should be checking that a domain does not attempt to
modify its own cpupool; fix this by using rcu_lock_remote_domain_by_id.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>

x86/mwait_idle: support Haswell

This patch enables intel_idle to run on the next-generation Intel(R)
Microarchitecture code named "Haswell".

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

Merge branch 'staging' of xenbits.xen.org:/home/xen/git/xen into staging

x86/mwait_idle: stop using driver_data for static flags

The (Linux) commit 4202735e8ab6ecfb0381631a0d0b58fefe0bd4e2
(cpuidle: Split cpuidle_state structure and move per-cpu statistics fields)
observed that the MWAIT flags for Cn on every processor to date were the
same, and created get_driver_data() to supply them.

Unfortunately, that assumption is false, going forward.
So here we restore the MWAIT flags to the cpuidle_state table.
However, instead restoring the old "driver_data" field,
we put the flags into the existing "flags" field,
where they probalby should have lived all along.

This patch does not change any operation.

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

Merge branch 'staging' of ssh://xenbits.xen.org/home/xen/git/xen into staging

x86/EFI: pass boot services variable info to runtime code

EFI variables can be flagged as being accessible only within boot services.
This makes it awkward for us to figure out how much space they use at
runtime. In theory we could figure this out by simply comparing the results
from QueryVariableInfo() to the space used by all of our variables, but
that fails if the platform doesn't garbage collect on every boot. Thankfully,
calling QueryVariableInfo() while still inside boot services gives a more
reliable answer. This patch passes that information from the EFI boot stub
up to the efi platform code.

Based on a similarly named Linux patch by Matthew Garrett <matthew.garrett@nebula.com>.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

EFI: update error indicators

... from gnu-efi-3.0t. Decode a few of them in x86's PrintErrMesg().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Keir Fraser <keir@xen.org>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

libxc: Add unsafe decompressors

Add decompressors based on hypervisor code. This are used in mini-os by
pv-grub.

This enables pv-grub to boot kernels compressed with e.g. xz, which are
becoming more common.

Signed-off-by: Bastian Blank <waldi@debian.org>
Adjusted to use terminology "unsafe" rather than "trusted" to indicate
that the user had better sanitise the data (or not care, as in stub
domains) as suggested by Tim Deegan. This was effectively a sed script.

Minimise the changes to hypervisor code by moving the "compat layer" into the
relevant libxc source files (which include the Xen ones).

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>

xen/arm: do not use is_running to decide whether we can write directly to the LR registers

During context switch is_running is set for the next vcpu before the
gic state is actually saved.
This leads to possible nasty races when interrupts need to be injected
after is_running is set to the next vcpu but before the currently
running gic state has been saved from the previous vcpu.

Use current instead of is_running to check which one is the currently
running vcpu: set_current is called right before __context_switch and
schedule_tail with interrupt disabled.

Re-enabled interrupts after ctxt_switch_from, so that all the context
switch saving functions don't have to worry about receiving interrupts
while saving state.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

mini-os/x86-64 entry: check against nested events and try to fix up

In hypervisor_callback, check against event re-entrant.
If we came from the critical region in interrupt context,
try to fix up by coalescing the two stack frames.
The execution is resumed as if the second event never happened.

Signed-off-by: Xu Zhang <xzhang@cs.uic.edu>
Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>

mini-os/x86-64 entry: defer RESTORE_REST until return

No need to do a RESTORE_REST at this point because if we saw pending
events after we enabled event delivery, we have to do a SAVE_REST again.
Instead, we do a "lazy" RESTORE_REST, deferring it until actual return.
The offset of saved-on-stack rflags register is changed as well.

Signed-off-by: Xu Zhang <xzhang@cs.uic.edu>
Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>

mini-os/x86-64 entry: remove unnecessary event blocking

We don't need to block events here because:
- if we came from "hypervisor_callback", events are disabled at this point,
   no need to block again;
- if we came from "error_entry", we shouldn't touch event mask, for
   exception hanlding are meant to be interrupted by Xen events (virtual
   irq).

Signed-off-by: Xu Zhang <xzhang@cs.uic.edu>
Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>

mini-os/x86-64 entry: code refactoring; no functional changes

Re-arrange assembly code blocks so that they are in called
order instead of jumping around, enhancing readability.
Macros are grouped together as well.

Signed-off-by: Xu Zhang <xzhang@cs.uic.edu>
Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>

mini-os/x86-64 entry: define macros for registers partial save and restore

No functional changes.

For saving and restoring registers rbx, rbp, and r12-r15,
define and use macros SAVE_REST and RESTORE_REST respectively.

Signed-off-by: Xu Zhang <xzhang@cs.uic.edu>
Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>

mini-os/x86-64 entry: code clean-ups; no functional changes

Make use of tabs and spaces consistent in arch/x86/x86_64.S

Signed-off-by: Xu Zhang <xzhang@cs.uic.edu>
Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>

VMX: adjust correct table when there's no posted interrupt support

The caller of start_vmx() will overwrite hvm_funcs, so there's no point
in adjusting that table in start_vmx().

Signed-off-by: Jan Beulich <jbeulich@suse.com>

x86/S3: Fix cpu pool scheduling after suspend/resume

This review is another S3 scheduler problem with the system_state
variable introduced with the following changeset:
http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=269f543ea750ed567d18f2e819e5d5ce58eda5c5

Specifically, the cpu_callback function that takes the CPU down during
suspend, and back up during resume. We were seeing situations where,
after S3, only CPU0 was in cpupool0. Guest performance suffered
greatly, since all vcpus were only on a single pcpu. Guests under high
CPU load showed the problem much more quickly than an idle guest.

Removing this if condition forces the CPUs to go through the expected
online/offline state, and be properly scheduled after S3.

This also includes a necessary partial change proposed earlier by
Tomasz Wroblewski here:
http://lists.xen.org/archives/html/xen-devel/2013-01/msg02206.html

It should also resolve the issues discussed in this thread:
http://lists.xen.org/archives/html/xen-devel/2012-11/msg01801.html

Signed-off-by: Ben Guthro <benjamin.guthro@citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>

x86: remove IS_PRIV bypass on IRQ check

This prevents a process in dom0 from granting a domU access to an IRQ without
adding the IRQ to the domU's list of permitted IRQs. This operation currently
succeeds in dom0 but would fail if the device model were running in a stubdom,
so making the failure consistent should ease debugging of the device-model
stubdoms.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>

x86: remove IS_PRIV access check bypasses

Several domctl functions dealing with rangesets contain a short-circuit
bypass if the domain is privileged. Since the construction of domain 0
permits access to all I/O ranges, the call to irq_access_permitted will
normally return true even without the IS_PRIV check, and the presence of
the IS_PRIV check prevents the creation of a privileged domain without
access to specific devices or IO memory ranges.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>

Merge branch 'staging' of ssh://xenbits.xen.org/home/xen/git/xen into staging

x86: fix various issues with handling guest IRQs

- properly revoke IRQ access in map_domain_pirq() error path
- don't permit replacing an in use IRQ
- don't accept inputs in the GSI range for MAP_PIRQ_TYPE_MSI
- track IRQ access permission in host IRQ terms, not guest IRQ ones
(and with that, also disallow Dom0 access to IRQ0)

This is CVE-2013-1919 / XSA-46.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

x86: clear EFLAGS.NT in SYSENTER entry path

... as it causes problems if we happen to exit back via IRET: In the
course of trying to handle the fault, the hypervisor creates a stack
frame by hand, and uses PUSHFQ to set the respective EFLAGS field, but
expects to be able to IRET through that stack frame to the second
portion of the fixup code (which causes a #GP due to the stored EFLAGS
having NT set).

And even if this worked (e.g if we cleared NT in that path), it would
then (through the fail safe callback) cause a #GP in the guest with the
SYSENTER handler's first instruction as the source, which in turn would
allow guest user mode code to crash the guest kernel.

Inject a #GP on the fake (NULL) address of the SYSENTER instruction
instead, just like in the case where the guest kernel didn't register
a corresponding entry point.

This is CVE-2013-1917 / XSA-44.

Reported-by: Andrew Cooper <andrew.cooper3@citirx.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

arm: vgic: fix race in vgic_vcpu_inject_irq

The initial check for a still pending interrupt (!list_empty(&n->inflight))
needs to be covered by the vgic lock to avoid trying to insert the IRQ into the
inflight list simultaneously on 2 pCPUS. Expand the area covered by the lock
appropriately.

Also consolidate the unlocks on the exit path into one location.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

arm64: correct secondary CPU bringup

The current cpuid is held in x22 not x12.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

arm: gic: implement IPIs using SGI mechanism

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

VMX: Use posted interrupt to deliver virutal interrupt

Deliver virtual interrupt through posted way if posted interrupt
is enabled.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Jun Nakajima <jun.nakajima@intel.com>
Acked-by: Keir Fraser <keir@xen.org>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com> (from a release perspective)

x86/HVM: Call vlapic_set_irq() to delivery virtual interrupt

Move kick_vcpu into vlapic_set_irq. And call it to deliver virtual interrupt
instead set vIRR directly.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Acked-by: Keir Fraser <keir@xen.org>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com> (from a release perspective)

VMX: Add posted interrupt supporting

Add the supporting of using posted interrupt to deliver interrupt.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Jun Nakajima <jun.nakajima@intel.com>
Acked-by: Keir Fraser <keir@xen.org>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com> (from a release perspective)

VMX: Turn on posted interrupt bit in vmcs

Turn on posted interrupt for vcpu if posted interrupt is avaliable.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Jun Nakajima <jun.nakajima@intel.com>
Acked-by: Keir Fraser <keir@xen.org>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com> (from a release perspective)

VMX: Detect posted interrupt capability

Check whether the Hardware supports posted interrupt capability.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Jun Nakajima <jun.nakajima@intel.com>
Acked-by: Keir Fraser <keir@xen.org>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com> (from a release perspective)

libxl: properly initialize device structures

This avoids returning unallocated memory in the libxl_device_vtpm
structure in libxl_device_vtpm_list, and uses libxl_device_nic_init
instead of memset when initializing libxl_device_nics.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

libxl: postpone backend name resolution

This adds a backend_domname field in libxl devices that contain a
backend_domid field, allowing either a domid or a domain name to be
specified in the configuration structures. The domain name is resolved
into a domain ID in the _setdefault function when adding the device.
This change allows the backend of the block devices to be specified
(which previously required passing the libxl_ctx down into the block
device parser), and will simplify specification of backend domains in
other users of libxl.

The check on run_hotplug_scripts in parse_config_data is removed because
it is a duplicate of the one in libxl__device_nic_setdefault, and is
removed here because it no longer has the resolved domain ID to check.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- reran flex ]

docs: Don't use non-portable cp -d flag unnecessarily (no links are involved)

cp -d is a GNU extension, and only useful for links, which don't seem
to exist in the directories being copied, so just remove said flag.

Signed-off-by: Patrick Welche <prlw1@cam.ac.uk>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

configure: test(1) uses = not == for string comparison

Avoids a bash-ism.

Signed-off-by: Patrick Welche <prlw1@cam.ac.uk>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

tools/ocaml: clean META files

As META files are generated from META.in files, they should be cleaned
by clean rules.

Signed-off-by: Vincent Bernardoff <vincent.bernardoff@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

Merge branch 'staging' of ssh://xenbits.xen.org/home/xen/git/xen into staging

docs: rearrange and update NUMA placement documentation

To include the new concept of NUMA aware scheduling and
describe its impact.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

xl: add node-affinity to the output of `xl list`

Node-affinity is now something that is under (some) control of the
user, so show it upon request as part of the output of `xl list'
by the `-n' option.

Re the patch, the print_bitmap() related hunk is _mostly_ code motion,
although there is a very minor change in the code, basically to allow
using the function for printing both cpu and node bitmaps (as, in case
all bits are sets, it used to print "any cpu", which doesn't fit the
nodemap case).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

libxl: automatic placement deals with node-affinity

Which basically means the following two things:
1) during domain creation, it is the node-affinity of
    the domain --rather than the vcpu-affinities of its
    VCPUs-- that is affected by automatic placement;
2) during automatic placement, when counting how many
    VCPUs are already "bound" to a placement candidate
    (as part of the process of choosing the best
    candidate), both vcpu-affinity and node-affinity
    are considered.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>

libxl: optimize the calculation of how many VCPUs can run on a candidate

For choosing the best NUMA placement candidate, we need to figure out
how many VCPUs are runnable on each of them. That requires going through
all the VCPUs of all the domains and check their affinities.

With this change, instead of doing the above for each candidate, we
do it once for all, populating an array while counting. This way, when
we later are evaluating candidates, all we need is summing up the right
elements of the array itself.

This reduces the complexity of the overall algorithm, as it moves a
potentially expensive operation (for_each_vcpu_of_each_domain {})
outside from the core placement loop, so that it is performed only
once instead of (potentially) tens or hundreds of times. More
specifically, we go from a worst case computation time complaxity of:

O(2^n_nodes) * O(n_domains*n_domain_vcpus)

To, with this change:

O(n_domains*n_domains_vcpus) + O(2^n_nodes) = O(2^n_nodes)

(with n_nodes<=16, otherwise the algorithm suggests partitioning with
cpupools and does not even start.)

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

libxl: allow for explicitly specifying node-affinity

By introducing a nodemap in libxl_domain_build_info and
providing the get/set methods to deal with it.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

libxc: allow for explicitly specifying node-affinity

By providing the proper get/set interface and wiring them
to the new domctl-s from the previous commit.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

xen: allow for explicitly specifying node-affinity

Make it possible to pass the node-affinity of a domain to the hypervisor
from the upper layers, instead of always being computed automatically.

Note that this also required generalizing the Flask hooks for setting
and getting the affinity, so that they now deal with both vcpu and
node affinity.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
Acked-by: Keir Fraser <keir@xen.org>

xen: sched_credit: let the scheduler know about node-affinity

As vcpu-affinity tells where VCPUs must run, node-affinity tells
where they prefer to. While respecting vcpu-affinity remains mandatory,
node-affinity is not that strict, it only expresses a preference,
although honouring it will bring significant performance benefits
(especially as compared to not having any affinity at all).

This change modifies the VCPUs load balancing algorithm (for the
credit scheduler only), introducing a two steps logic. During the
first step, we use both the vcpu-affinity and the node-affinity
masks (by looking at their intersection). The aim is giving precedence
to the PCPUs where the domain prefers to run, as expressed by its
node-affinity (with the intersection with the vcpu-afinity being
necessary in order to avoid running a VCPU where it never should).
If that fails in finding a valid PCPU, the node-affinity is just
ignored and, in the second step, we fall back to using cpu-affinity
only.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Acked-by: Keir Fraser <keir@xen.org>

xen: sched_credit: when picking, make sure we get an idle one, if any

The pcpu picking algorithm treats two threads of a SMT core the same.
More specifically, if one is idle and the other one is busy, they both
will be assigned a weight of 1. Therefore, when picking begins, if the
first target pcpu is the busy thread (and if there are no other idle
pcpu than its sibling), that will never change.

This change fixes this by ensuring that, before entering the core of
the picking algorithm, the target pcpu is an idle one (if there is an
idle pcpu at all, of course).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Acked-by: Keir Fraser <keir@xen.org>

xen, libxc: introduce xc_nodemap_t

And its handling functions, following suit from xc_cpumap_t.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
Acked-by: Keir Fraser <keir@xen.org>

xen, libxc: rename xenctl_cpumap to xenctl_bitmap

More specifically:
1. replaces xenctl_cpumap with xenctl_bitmap
2. provides bitmap_to_xenctl_bitmap and the reverse;
3. re-implement cpumask_to_xenctl_bitmap with
bitmap_to_xenctl_bitmap and the reverse;

Other than #3, no functional changes. Interface only slightly
afected.

This is in preparation of introducing NUMA node-affinity maps.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
Acked-by: Keir Fraser <keir@xen.org>

x86: allow VCPUOP_register_vcpu_info to work again on PVHVM guests

For details on the hypercall please see commit
c58ae69360ccf2495a19bf4ca107e21cf873c75b (VCPUOP_register_vcpu_info) and
the c/s 23143 (git commit 6b063a4a6f44245a727aa04ef76408b2e00af9c7)
(x86: move pv-only members of struct vcpu to struct pv_vcpu)
that introduced the regression.

The current code allows the PVHVM guest to make this hypercall.
But for PVHVM guest it always returns -EINVAL (-22) for Xen 4.2
and above. Xen 4.1 and earlier worked.

The reason is that the check in map_vcpu_info would fail
at:

  if ( v->arch.vcpu_info_mfn != INVALID_MFN )

The reason is that the vcpu_info_mfn for PVHVM guests ends up by
defualt with the value of zero (introduced by c/s 23143).

The code in vcpu_initialise which initialized vcpu_info_mfn to a
valid value (INVALID_MFN), would never be called for PVHVM:

    if ( is_hvm_domain(d) )
    {
        rc = hvm_vcpu_initialise(v);
        goto done;
    }

    v->arch.pv_vcpu.vcpu_info_mfn = INVALID_MFN;

while previously it would be:

     v->arch.vcpu_info_mfn = INVALID_MFN;

[right at the start of the function in Xen 4.1]

This fixes the problem with Linux advertising this error:
register_vcpu_info failed: err=-22

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xl: Fix 'free_memory' to include outstanding_claims value.

Updating to make it clear that free_memory reported by 'xl info'
is influenced by the outstanding claim value. That is the free
memory that will be available to the host once all outstanding
claims have been completed. This modifies the behavior that the
patch titled "xl: 'xl info' print outstanding claims if enabled
(claim_mode=1 in xl.conf)" had - which reported the
outstanding claims and nothing else.

The free_pages as reported by the hypervisor is the currently
available count of pages on the heap. The outstanding pages is
the total amount of pages reserved for guests (so not taken from
the heap yet). As guests are being populated the memory from the
heap shrinks and the outstanding count of pages decreases.
The total memory used for guests increases.

As the available count of pages on the heap and outstanding
claims are intertwined, report the amount of free memory available
to be a combination of that. That is free heap memory minus the
outstanding pages.

We also make some odd choices in reporting. By default we will
only display 'outstanding_claims' if the claim_mode is enabled
in the global configuration file. However, if there are outstanding
claims, we will ignore the claim_mode and report these values.

Suggested-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xl: 'xl claims' print outstanding per domain claims

This is similar to "xl: 'xl info' print outstanding claims if enabled
(claim_mode=1 in xl.conf)" which exposes the global claim value.

This patch provides the value of the currently outstanding pages
claimed for each domains. This is per domain value which is added
to the global claim value which influences the hypervisors' MM system.

When a claim call is done, a reservation for a specific amount of pages
is set (and this patch lists said number) and also a global value is
incremented. This global value is then reduced as the domain's memory
is populated and eventually reaches zero.

The toolstack (libxc) also sets the domain's claim to zero when the population
of memory has completed as an extra step. Any call to destroy the domain
will also set the domain's claim to zero.

If the reservation cannot be meet the guest creation fails immediately
instead of taking seconds or minutes (depending on the size of the guest)
while the toolstack populates memory.

See patch: "xl: Implement XENMEM_claim_pages support via 'claim_mode'
global config" for details on how it is implemented.

The value fluctuates quite often so the value is stale once it is provided
to the user-space.  However it is useful for diagnostic purposes.

It is printed irregardless of global "claim_mode" option in xl.conf(5).
That is b/c the user might have enabled, launched a guest, and then
disabled the option - and we should still report the correct outstanding
claim value.  The 'man xl' shows the details of this argument.
The output is close to what 'xl list' looks like:

Name                                        ID   Mem VCPUs      State   Time(s)  Claimed
Domain-0                                     0  2047     4     r-----      19.7     0
OL5                                          2  2048     1     --p---       0.0   847
OL6                                          3  1024     4     r-----       5.9     0
Windows_XP                                   4  2047     1     --p---       0.0  1989

[In which it can be seen that the OL5 guest still has 847MB of claimed
memory (out of the total 2048MB where 1191MB has been allocated to
the guest).]

Please note that the 'Mem' column has the cumulative value of outstanding
claims and the total amount of memory that has been allocated to the guest.

[v1: claims, not claim-list]
[v2: Add outstanding and current memkb in the output list]
[v3: Clairy docs and relax some checks]
[v4: Removed comments about guest config memory being the same as 'Mem']
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xl: export 'outstanding_pages' value from xcinfo

This patch provides the value of the currently outstanding pages
claimed for a specific domain. This is a value that influences
the global outstanding claims value (See patch: "xl: 'xl info'
print outstanding claims if enabled") returned via
xc_domain_get_outstanding_pages hypercall. This domain value
decrements as the memory is populated for the guest and
eventually reaches zero.

With this patch it is possible to utilize this field.

Acked-by: Ian Campbell <ian.campbell@citrix.com>
[v2: s/unclaimed/outstanding/ per Tim's suggestion]
[v3: Don't use SXP printout file per Ian's suggestion]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

xc: export outstanding_pages value in xc_dominfo structure.

This patch provides the value of the currently outstanding pages
claimed for a specific domain. This is a value that influences
the global outstanding claims value (See patch: "xl: 'xl info'
print outstanding claims if enabled") returned via
xc_domain_get_outstanding_pages hypercall. This domain value
decrements as the memory is populated for the guest and
eventually reaches zero.

This patch is neccessary for "xl: export 'outstanding_pages' value
from xcinfo" patch.

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v2: s/unclaimed_pages/outstanding_pages/ per Tim's suggestion]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

xl: 'xl info' print outstanding claims if enabled (claim_mode=1 in xl.conf)

This patch provides the value of the currently outstanding pages
claimed for all domains. This is a total global value that influences
the hypervisors' MM system.

When a claim call is done, a reservation for a specific amount of pages
is set and also a global value is incremented. This global value is then
reduced as the domain's memory is populated and eventually reaches zero.
The toolstack (libxc) also sets the domain's claim to zero when the population
of memory has completed as an extra step. Any call to destroy the domain
will also set the domain's claim to zero.

If the reservation cannot be meet the guest creation fails immediately
instead of taking seconds or minutes (depending on the size of the guest)
while the toolstack populates memory.

See patch: "xl: Implement XENMEM_claim_pages support via 'claim_mode'
global config" for details on how it is implemented.

The value fluctuates quite often so the value is stale once it is provided
to the user-space. However it is useful for diagnostic purposes.

It is only printed when the global "claim_mode" option in xl.conf(5)
is set to enabled (1). The 'man xl' shows the details of this item.

[v1: s/unclaimed/outstanding/]
[v2: Made libxl_get_claiminfo return just MemKB suggested by Ian Campbell]
[v3: Made libxl_get_claininfo return MemMB to conform to the other values printed]
[v4: Improvements suggested by Ian Jackson, also added docs to xl.pod.1]
[v5: Clarify how claims are cancelled, split >72 characters - Ian Jackson]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

xl: Implement XENMEM_claim_pages support via 'claim_mode' global config

The XENMEM_claim_pages hypercall operates per domain and it should be
used system wide. As such this patch introduces a global configuration
option 'claim_mode' that by default is disabled.

If this option is enabled then when a guest is created there will be an
guarantee that there is memory available for the guest. This is an
particularly acute problem on hosts with memory over-provisioned guests
that use tmem and have self-balloon enabled (which is the default option
for them). The self-balloon mechanism can deflate/inflate the balloon
quickly and the amount of free memory (which 'xl info' can show) is stale
the moment it is printed. When claim is enabled a reservation for the
amount of memory ('memory' in guest config) is set, which is then reduced
as the domain's memory is populated and eventually reaches zero.

If the reservation cannot be meet the guest creation fails immediately
instead of taking seconds/minutes (depending on the size of the guest)
while the guest is populated.

Note that to enable tmem type guests, one needs to provide 'tmem' on the
Xen hypervisor argument and as well on the Linux kernel command line.

There are two boolean options:

(0) No claim is made. Memory population during guest creation will be
attempted as normal and may fail due to memory exhaustion.

(1) Normal memory and freeable pool of ephemeral pages (tmem) is used when
calculating whether there is enough memory free to launch a guest.
This guarantees immediate feedback whether the guest can be launched due
to memory exhaustion (which can take a long time to find out if launching
massively huge guests) and in parallel.

[v1: Removed own claim_mode type, using just bool, improved docs, all per
Ian's suggestion]
[v2: Updated the comments]
[v3: Rebase on top 733b9c524dbc2bec318bfc3588ed1652455d30ec (xl: add vif.default.script)]
[v4: Fixed up comments]
[v5: s/global_claim_mode/claim_mode/]
[v6: Ian Jackson's feedback: use libxl_defbool, better comments, etc]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>